Skip to content

Conversation

@metalcycling
Copy link
Collaborator

In this PR, code changes are intended to fix the requeuing mechanism from triggering an AppWrapper requeuing at the beginning of the execution due to PODs taking a long time to become Running. This has been observed to happen when pulling images takes too long or in the presence of init containers that take a long time to run. The new exponential waiting time growth allows for long running starting PODs to eventually have enough time to get ready. In cases where the user wants to specify this by themselves, we expose the requeuingTimeMinutes field in the schedulingSpec stanza. By default this value is set to 5 minutes which, in our experience, is usually enough time for most applications. The new updates also check the requeuing time with respect to the last condition as opposed to when the AppWrapper was Dispatched.

… to specify how long to wait for the first requeuing period after dispatching.
…ntially. This time is set by default to 5 minutes but can be modified from the AppWrapper schedulingSpec field.
…m with user supplied initial time. This tests uses init containers to trick the requeuing mechanism into thinking the PODs have failed and are not going to complete.
Copy link
Member

@asm582 asm582 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, can you please address one comment?

asm582
asm582 previously approved these changes Feb 14, 2023
@metalcycling metalcycling merged commit f52b39c into project-codeflare:main Feb 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants